1. Abstract and Motivation

Continuing my past work on the Amazon toys dataset, I dived, and have been really wanting to, into the links stored in the corr_view data since we have not yet looked into it while it was the information behind the links that was precious to our analyses. For example, I was interested in the popularity of a certain product, especially when it was compared with its direct competitors since the category “building blocks” would definitely be considered in higher demand than customs in terms of the absolute number of shelved items.

Therefore, I scraped the Amazon.com in this assignment, which I were not able to during the summer because, said Mr.Gleason, the website utilized JavaScript and only browser can convert it into actual html document. But with RSelenium, I built a virtual browser in a Linux environment and crawled the webpage via it with command lines. I sampled some links from the customer also shoppped for field and built a XXXXX model using Google tensorflow. The result was exhilarating that XXXXXX

Insights from analysis:

2. Metadata and Environment Preparation

Since I used Tnesorflow in this assignment, a high-level interface for neural networks from Google other than the packages introduced in the class, please run the chunk below to install and import this library.

# devtools::install_github("rstudio/keras")
library(keras)
# install_keras()

The tables below show the structure of my data frames from web scraping apart from the original Amazon toys dataset.

Variable Name Data Structure Description
link_id char key/link of the toy
product char toy’s name
star_rating num
price num
n_rev num number of reviews
Variable Name Data Structure Description
link_id char key/link of the toy
category char
Variable Name Data Structure Description
link_id char key/link of the toy
title char review title
rating num review’s star rating
date Date
author char reviewer
contents char review contents
Variable Name Data Structure Description
link_id char link
also_view char link of also viewed items
Variable Name Data Structure Description
link_id char link
also_bought char link of also bought items

2. Web scraping

Continuing my exploration on the Amazon toy’s dataset, I decided to exploit the “Customers also shopped for” field because the relations between products are the most valueable information on the all-linked Internet and it was the “endorsement” of a website to another that the Google or other serach engines employed in their algorithms.

However, Amazon has deployed many anti-crawler techniques. Kindly recommended by Mr.Gleason, our guest speaker, I used RSelenium, which is very powerful in such situation but time-consuming simultaneously – 48 hours for me to acquire 10K+ entries.

2.1 For a single page - Corr2 Attempt

If we try crawling the amazon’s product page with rvest directly, we will get a void value on many relevant fields such as the “Customers also shopped for”. My suspicion is that Amazon’s product webpages utilize Adobe Flash or JavaScript to save data celluar in the communication so we have to deceive Amazon with a virtual browser created by RSelenium.

# Customers also shopped for
(link <- corr_view$also_view[1])
## [1] "http://www.amazon.co.uk/Hornby-R8150-Catalogue-2015/dp/B00S9SUUBE"
doc <- read_html(link)
  
# corr name
# #anonCarousel1 .p13n-sc-truncated
doc %>%
  html_nodes(css = "#anonCarousel1 .p13n-sc-truncated")
## {xml_nodeset (0)}

2.2 For a single page - RSelenium

The RSelenium enables us to communicate with a remote or local server with command lines – navigate to the targeted website, access certain HTML nodes, scrape them down, then next.

Here I used a browser set up on my personal computer in a Linux environment.

It is truly amazing that we can see how the virtual browser interact with us when we send commands to it.

## 1. SET UP RSELENIUM
# Before we access the server on R, we need to initiate an instance in our local environment by ruinning the line below in the docker.
# docker run --name chrome  -d -p 4445:4444 -p 5901:5900 selenium/standalone-chrome-debug:latest
docker1

docker1

# tools used: docker, standalone-chrome, tightVNC
# need to change based on the local ip shown in the machine's docker
IP <- "192.168.99.100"

# set up our virtual browser
remDr <- remoteDriver(remoteServerAddr = IP,
                      port = 4445L,
                      browser = "chrome")

# check available keys that wen can send
RSelenium:::selKeys %>% names()
##  [1] "null"         "cancel"       "help"         "backspace"   
##  [5] "tab"          "clear"        "return"       "enter"       
##  [9] "shift"        "control"      "alt"          "pause"       
## [13] "escape"       "space"        "page_up"      "page_down"   
## [17] "end"          "home"         "left_arrow"   "up_arrow"    
## [21] "right_arrow"  "down_arrow"   "insert"       "delete"      
## [25] "semicolon"    "equals"       "numpad_0"     "numpad_1"    
## [29] "numpad_2"     "numpad_3"     "numpad_4"     "numpad_5"    
## [33] "numpad_6"     "numpad_7"     "numpad_8"     "numpad_9"    
## [37] "multiply"     "add"          "separator"    "subtract"    
## [41] "decimal"      "divide"       "f1"           "f2"          
## [45] "f3"           "f4"           "f5"           "f6"          
## [49] "f7"           "f8"           "f9"           "f10"         
## [53] "f11"          "f12"          "command_meta"
# initiate our browser
remDr$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "77.0.3865.75"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "77.0.3865.40 (f484704e052e0b556f8030b65b953dce96503217-refs/branch-heads/3865@{#442})"
## 
## $chrome$userDataDir
## [1] "/tmp/.com.google.Chrome.PjP7uc"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:42127"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "linux"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $webdriver.remote.sessionid
## [1] "cb3939f9fa64d6f31300e58905d5d5e2"
## 
## $id
## [1] "cb3939f9fa64d6f31300e58905d5d5e2"
remDr$navigate("http://www.google.com")
Sys.sleep(5)
webElem <- remDr$findElement("name", "q")
Sys.sleep(5)
webElem$sendKeysToElement(list("HELLO WORLD"))
Sys.sleep(5)
webElem$sendKeysToElement(list(key = 'enter'))

# # can also play with the browser using the code below
# class(remDr)
# remDr$goBack()
# remDr$goForward()

By using the TightVNC, a remote control software, we can visualize our server in real time. To do that, we need to download and install this software with the Administrator’s authorization, launch it, fill in the first blank with the id and the port, 192.168.99.100:: on my computer, click connect, and use the “secret” as the password as set by the docker “standalone-chrome”.

vnc1

vnc1

Then I attempted scraping the first observation of table corr_view on the same webpage as the last time again.

## 2. SCRAPING

remDr$navigate(link)

Sys.sleep(5)

get_html <- function(remDr){
  remDr$getPageSource() %>%
   .[[1]] %>%
  read_html()
}

doc <- get_html(remDr)

# we finally have something valid
doc %>%
    html_nodes(css = "#bundleV2_feature_div+ .celwidget .a-carousel-initialized") %>%
    html_text() %>%
    str_trim()
## [1] "Sponsored products related to this itemPage 1 of 3Start overPage 1 of 3            Previous page of related Sponsored Products                                                                                                                                                                                                        (function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {        P.when('SponsoredProductsViewability').execute(function(SV) {            SV.loadImagePixel(\"/gp/sponsored-products/logging/log-action.html?qualifier=1569469652&id=4189520432981751&widgetName=sp_detail&adId=20022974795501&eventType=2&adIndex=0\");        });    }));(function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {    P.when('A', 'SponsoredProductsViewability').execute(function(A, SV) {        SV.registerViewTrackingElement(A.$(\"#sp_detail_B07N88MGS3\"), \"sp_detail\");    });}));                        Feedback                                                                        Hornby R1234 Harry Potter Hogwarts Express Train Set                                                        £160.00                                                                                                                                                                                                                                                                                                                                                                                                                                                                (function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {        P.when('SponsoredProductsViewability').execute(function(SV) {            SV.loadImagePixel(\"/gp/sponsored-products/logging/log-action.html?qualifier=1569469652&id=4189520432981751&widgetName=sp_detail&adId=20025616578903&eventType=2&adIndex=1\");        });    }));(function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {    P.when('A', 'SponsoredProductsViewability').execute(function(A, SV) {        SV.registerViewTrackingElement(A.$(\"#sp_detail_B075DHJXN2\"), \"sp_detail\");    });}));                        Feedback                                                                        Thomas and Friends FHM17 Wood Percy, Thomas The Tank Engine Wooden Toy Engine, Toy ...                                        16                                                        £9.50                                                                                                                                                                                                                                                                                                                                                                                                                                                                (function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {        P.when('SponsoredProductsViewability').execute(function(SV) {            SV.loadImagePixel(\"/gp/sponsored-products/logging/log-action.html?qualifier=1569469652&id=4189520432981751&widgetName=sp_detail&adId=20023744421301&eventType=2&adIndex=2\");        });    }));(function(f) {f(window.P._namespace(\"FirebirdSpRendering\"));}(function(P) {    P.when('A', 'SponsoredProductsViewability').execute(function(A, SV) {        SV.registerViewTrackingElement(A.$(\"#sp_detail_B078K44BP8\"), \"sp_detail\");    });}));                        Feedback                                                                        LEGO 60197 City Trains Passenger Train Set, Battery Powered Engine, RC Bluetooth Co...                                        42                                                            Limited time deal                                                        £89.99                                                                                                                                                List: £119.99 (25% off)                                                                                                                                                                                                            Next page of related Sponsored Products                             Ad feedback"

Exciting! We got something valid now!

Then I continued my work – formatting multiple fields into a tidy table. It was worth mentioning that I handled errors by employing tryCatch() as it is a common situation meet in web scraping.

# Initialization for the loop

# create a function for multiple field
get_product <- function(link, remDr){
  
  # The dataset originates from the UK's Amazon
  base_link <- "http://www.amazon.co.uk"
  
  # Initilization for the function
  
  # buffer time for webpage to load
  
  doc <- get_html(remDr)
  remDr$navigate(link)

  # regard link as the id of an observation -- product (toy)
  
  # product_name
  # #productTitle
  product_name <- doc %>%
    html_nodes("#productTitle") %>%
    html_text() %>%
    str_trim()
  
  # brand
  # #bylineInfo
  brand <- doc %>%
    html_nodes("#bylineInfo") %>%
    html_text() %>%
    str_trim()
  
  # star rating
  # .arp-rating-out-of-text
  product_star <- doc %>%
    html_nodes(".arp-rating-out-of-text") %>%
    html_text() %>%
    str_trim() %>%
    # convert string to number
    str_remove_all(" out of 5 stars") %>%
    as.numeric()
  
  # price
  # #priceblock_ourprice
  product_price <- doc %>%
    html_nodes("#priceblock_ourprice") %>%
    html_text() %>%
    str_trim() %>%
    str_remove_all("£") %>%
    as.numeric()
  
  # category
  # #wayfinding-breadcrumbs_feature_div .a-size-small
  category <- doc %>%
    html_nodes("#wayfinding-breadcrumbs_feature_div .a-size-small") %>%
    html_text() %>%
    str_trim() %>%
    str_remove_all("\n[[:space:]]*")
  
  # n_rev
  # #prodDetails .a-size-small .a-link-normal
  n_rev <- doc %>%
    html_nodes("#prodDetails .a-size-small .a-link-normal") %>%
    html_text() %>%
    str_trim() %>%
    str_remove_all(" customer reviews") %>%
    as.numeric()
  
  ######
  # also_view
  # #anonCarousel1 .a-link-normal
  partial_also_view <- doc %>%
    html_nodes("#anonCarousel1 .a-link-normal.a-text-normal") %>%
    html_attr("href")
  also_view <- paste0(base_link, partial_also_view)
  
  ######
  # also_bought 
  # #anonCarousel2 .a-link-normal
  partial_also_bought <- doc %>%
    html_nodes('#anonCarousel2 .a-link-normal.a-text-normal') %>%
    html_attr("href")
  also_bought <- paste0(base_link, partial_also_bought)
  
  ########
  ## review
  
  # rev_title
  # #cm-cr-dp-review-list .a-text-bold span
  rev_title <- doc %>%  
    html_nodes(".review-title-content.a-text-bold") %>%
    html_text() %>%
    str_trim()
  
  # rev_author
  # .a-profile-name
  rev_author <- doc %>%  
    html_nodes(".a-profile-name") %>%
    html_text() %>%
    str_trim()
  
  # rev_date
  # .review-date
  rev_date <- doc %>%  
    html_nodes(".review-date") %>%
    html_text() %>%
    str_trim() %>%
    as.Date(format = "%d %B %Y")
  
  ## hard, capture pop up content
  # rev_rating
  # #cm-cr-dp-review-list .a-icon-alt
  rev_star <- doc %>%
    html_nodes(".cr-translate-cta+ .a-row") %>%
    html_nodes(".a-icon-alt") %>%
    html_text() %>%
    str_trim() %>%
    # convert string to number
    str_remove_all(" out of 5 stars") %>%
    as.numeric()
  
  rev_star2 <-  doc %>%
    html_nodes("#cm-cr-cmps-review-list .celwidget") %>%
    html_nodes(".a-icon-alt") %>%
    html_text() %>%
    str_trim() %>%
    # convert string to number
    str_remove_all(" out of 5 stars") %>%
    as.numeric()
    
  rev_star <- append(rev_star, rev_star2)
  # rev_contents
  # .a-expander-partial-collapse-content span
  rev_contents <- doc %>%  
    # .cr-widget-CrossMarketplaceSharing , .card-padding
    html_nodes(".cm_cr_grid_center_container") %>%
    html_nodes(".a-expander-partial-collapse-content > span") %>%
    html_text() %>%
    str_trim()
  
  ###################################
  ## create relational tables
  tryCatch({
  # 1.primary table
  toys_prim <- tibble(
    link_id = link,
    brand = brand,
    product = product_name,
    star_rating = product_star,
    price = product_price,
    n_rev = n_rev
  )
  
  # 2.category
  category <- tibble(
    link_id = link,
    category = category
  )
  
  # 3. review
  review <- tibble(
    link_id = link,
    title = rev_title,
    rating = rev_star,
    date = rev_date,
    author = rev_author,
    contents = rev_contents
  )
  
  # 4. corr_view
  corr_view <- tibble(
    link_id = link,
    also_view = also_view
  )
  
  # 5. corr_purchase
  corr_purchase <- tibble(
    link_id = link,
    also_bought = also_bought
  )
  })
  out <- list(
    "toys_prim" = toys_prim, 
    "category" = category, 
    "review" = review, 
    "corr_view" = corr_view, 
    "corr_purchase" = corr_purchase)
  
  return(out)
}

(my_toy <- get_product(link, remDr))
## $toys_prim
## # A tibble: 1 x 6
##   link_id                        brand product      star_rating price n_rev
##   <chr>                          <chr> <chr>              <dbl> <dbl> <dbl>
## 1 http://www.amazon.co.uk/Hornb~ Horn~ Hornby Cata~         4.3  17.2    24
## 
## $category
## # A tibble: 1 x 2
##   link_id                            category                              
##   <chr>                              <chr>                                 
## 1 http://www.amazon.co.uk/Hornby-R8~ Hobbies›Model Building›Pre-Built & Di~
## 
## $review
## # A tibble: 8 x 6
##   link_id          title       rating date       author   contents         
##   <chr>            <chr>        <dbl> <date>     <chr>    <chr>            
## 1 http://www.amaz~ Nothing bu~      5 2015-12-09 S. Kerr  Bought this for ~
## 2 http://www.amaz~ Excellent ~      5 2015-10-22 M. G. M~ Beautifully phot~
## 3 http://www.amaz~ Catalogue ~      2 2015-12-04 Mrs E C~ I ordered 2015 n~
## 4 http://www.amaz~ Five Stars       5 2016-07-17 Amazon ~ excellent catalo~
## 5 http://www.amaz~ i like the~      5 2016-03-03 charles~ wanted the catal~
## 6 http://www.amaz~ Very good ~      4 2015-08-07 sheldon~ Gave me an overv~
## 7 http://www.amaz~ Well chuff~      5 2015-12-02 Thomas ~ As good as usual~
## 8 http://www.amaz~ Five Stars       5 2015-12-23 Katheri~ Very good my son~
## 
## $corr_view
## # A tibble: 1 x 2
##   link_id                                               also_view          
##   <chr>                                                 <chr>              
## 1 http://www.amazon.co.uk/Hornby-R8150-Catalogue-2015/~ http://www.amazon.~
## 
## $corr_purchase
## # A tibble: 1 x 2
##   link_id                                              also_bought         
##   <chr>                                                <chr>               
## 1 http://www.amazon.co.uk/Hornby-R8150-Catalogue-2015~ http://www.amazon.c~
# remDr$close()

In fact we can go with the hyperlink infinitely in Amazon as long as it provides us with the “Customers also shopped for” or “Customers who bought this item also bought”. The directory of hyperlinks essentially form a directed graph to which we can apply the graph theory. That is exactly how Google establish its search system and can be of the interest of future analyses.

2.3 For multiple pages using for loop

The reason we use for loop instead of map() with a wrapper or other vectorized computation is that it is easier for us to debug using for loop because: - it break where the error happens - the map() will just split up and we have to start over again every time

Before the iteration, we need a binder to bind_rows() of our result then store it.

# write 
rbind_product <- function(my_product, new_product) {
  my_prim <- bind_rows(my_product$toys_prim, new_product$toys_prim)
  my_category <- bind_rows(my_product$category, new_product$category)
  my_rev <- bind_rows(my_product$review, new_product$review)
  my_corr_view <- bind_rows(my_product$corr_view, new_product$corr_view)
  my_corr_purchase <- bind_rows(my_product$corr_purchase, new_product$corr_purchase)
  
  out <- list(
    "toys_prim" = my_prim, 
    "category" = my_category, 
    "review" = my_rev, 
    "corr_view" = my_corr_view, 
    "corr_purchase" = my_corr_purchase)
  
  return(out)
}

# test it
new_toy <- get_product(corr_view$also_view[2], remDr)
rbind_product(my_toy, new_toy)
## $toys_prim
## # A tibble: 2 x 6
##   link_id                        brand  product     star_rating price n_rev
##   <chr>                          <chr>  <chr>             <dbl> <dbl> <dbl>
## 1 http://www.amazon.co.uk/Hornb~ Hornby Hornby Cat~         4.3  17.2    24
## 2 http://www.amazon.co.uk/Hornb~ Hornby Hornby Cat~         4.3  17.2    24
## 
## $category
## # A tibble: 2 x 2
##   link_id                              category                            
##   <chr>                                <chr>                               
## 1 http://www.amazon.co.uk/Hornby-R815~ Hobbies›Model Building›Pre-Built & ~
## 2 http://www.amazon.co.uk/Hornby-Book~ Hobbies›Model Building›Pre-Built & ~
## 
## $review
## # A tibble: 16 x 6
##    link_id           title       rating date       author  contents        
##    <chr>             <chr>        <dbl> <date>     <chr>   <chr>           
##  1 http://www.amazo~ Nothing bu~      5 2015-12-09 S. Kerr Bought this for~
##  2 http://www.amazo~ Excellent ~      5 2015-10-22 M. G. ~ Beautifully pho~
##  3 http://www.amazo~ Catalogue ~      2 2015-12-04 Mrs E ~ I ordered 2015 ~
##  4 http://www.amazo~ Five Stars       5 2016-07-17 Amazon~ excellent catal~
##  5 http://www.amazo~ i like the~      5 2016-03-03 charle~ wanted the cata~
##  6 http://www.amazo~ Very good ~      4 2015-08-07 sheldo~ Gave me an over~
##  7 http://www.amazo~ Well chuff~      5 2015-12-02 Thomas~ As good as usua~
##  8 http://www.amazo~ Five Stars       5 2015-12-23 Kather~ Very good my so~
##  9 http://www.amazo~ Nothing bu~      5 2015-12-09 S. Kerr Bought this for~
## 10 http://www.amazo~ Excellent ~      5 2015-10-22 M. G. ~ Beautifully pho~
## 11 http://www.amazo~ Catalogue ~      2 2015-12-04 Mrs E ~ I ordered 2015 ~
## 12 http://www.amazo~ Five Stars       5 2016-07-17 Amazon~ excellent catal~
## 13 http://www.amazo~ i like the~      5 2016-03-03 charle~ wanted the cata~
## 14 http://www.amazo~ Very good ~      4 2015-08-07 sheldo~ Gave me an over~
## 15 http://www.amazo~ Well chuff~      5 2015-12-02 Thomas~ As good as usua~
## 16 http://www.amazo~ Five Stars       5 2015-12-23 Kather~ Very good my so~
## 
## $corr_view
## # A tibble: 3 x 2
##   link_id                              also_view                           
##   <chr>                                <chr>                               
## 1 http://www.amazon.co.uk/Hornby-R815~ http://www.amazon.co.uk             
## 2 http://www.amazon.co.uk/Hornby-Book~ http://www.amazon.co.ukhttps://www.~
## 3 http://www.amazon.co.uk/Hornby-Book~ http://www.amazon.co.ukhttps://www.~
## 
## $corr_purchase
## # A tibble: 2 x 2
##   link_id                                               also_bought        
##   <chr>                                                 <chr>              
## 1 http://www.amazon.co.uk/Hornby-R8150-Catalogue-2015/~ http://www.amazon.~
## 2 http://www.amazon.co.uk/Hornby-Book-Model-Railways-E~ http://www.amazon.~
# test it
new_toy <- get_product(corr_view$also_view[50], remDr)
rbind_product(my_toy, new_toy)
## $toys_prim
## # A tibble: 1 x 6
##   link_id                        brand product      star_rating price n_rev
##   <chr>                          <chr> <chr>              <dbl> <dbl> <dbl>
## 1 http://www.amazon.co.uk/Hornb~ Horn~ Hornby Cata~         4.3  17.2    24
## 
## $category
## # A tibble: 2 x 2
##   link_id                            category                              
##   <chr>                              <chr>                                 
## 1 http://www.amazon.co.uk/Hornby-R8~ Hobbies›Model Building›Pre-Built & Di~
## 2 http://www.amazon.co.uk/Hornby-Ra~ Reference›Transport›Railways          
## 
## $review
## # A tibble: 16 x 6
##    link_id         title        rating date       author   contents        
##    <chr>           <chr>         <dbl> <date>     <chr>    <chr>           
##  1 http://www.ama~ Nothing but~      5 2015-12-09 S. Kerr  Bought this for~
##  2 http://www.ama~ Excellent r~      5 2015-10-22 M. G. M~ Beautifully pho~
##  3 http://www.ama~ Catalogue 2~      2 2015-12-04 Mrs E C~ I ordered 2015 ~
##  4 http://www.ama~ Five Stars        5 2016-07-17 Amazon ~ excellent catal~
##  5 http://www.ama~ i like the ~      5 2016-03-03 charles~ wanted the cata~
##  6 http://www.ama~ Very good a~      4 2015-08-07 sheldon~ Gave me an over~
##  7 http://www.ama~ Well chuffe~      5 2015-12-02 Thomas ~ As good as usua~
##  8 http://www.ama~ Five Stars        5 2015-12-23 Katheri~ Very good my so~
##  9 http://www.ama~ Lonely Plan~      4 2013-03-10 Scott    "Very nicely pr~
## 10 http://www.ama~ Everything ~      5 2014-08-03 S Sparr~ Bought this as ~
## 11 http://www.ama~ Very good b~      5 2018-02-11 Monty    Excellent book,~
## 12 http://www.ama~ It is a goo~      2 2017-12-04 Alastai~ It is a good bo~
## 13 http://www.ama~ Excellelent~      4 2014-03-30 Mr. Nic~ This is a well ~
## 14 http://www.ama~ Pristine co~      5 2017-12-15 Angela ~ Really carefull~
## 15 http://www.ama~ Hornby book~      5 2014-05-22 Mr R St~ I bought this v~
## 16 http://www.ama~ A must buy        5 2017-09-15 Liam     A must buy for ~
## 
## $corr_view
## # A tibble: 2 x 2
##   link_id                                               also_view          
##   <chr>                                                 <chr>              
## 1 http://www.amazon.co.uk/Hornby-R8150-Catalogue-2015/~ http://www.amazon.~
## 2 http://www.amazon.co.uk/Hornby-Railroad-Rothery-Indu~ http://www.amazon.~
## 
## $corr_purchase
## # A tibble: 6 x 2
##   link_id                            also_bought                           
##   <chr>                              <chr>                                 
## 1 http://www.amazon.co.uk/Hornby-R8~ http://www.amazon.co.uk               
## 2 http://www.amazon.co.uk/Hornby-Ra~ http://www.amazon.co.uk/Newcomers-Gui~
## 3 http://www.amazon.co.uk/Hornby-Ra~ http://www.amazon.co.uk/Hornby-R8156-~
## 4 http://www.amazon.co.uk/Hornby-Ra~ http://www.amazon.co.uk/Planning-Desi~
## 5 http://www.amazon.co.uk/Hornby-Ra~ http://www.amazon.co.uk/Model-Railway~
## 6 http://www.amazon.co.uk/Hornby-Ra~ http://www.amazon.co.uk/Hornby-Book-S~

It appears that we acquired some blank pages at this step because they had expired or were formatted in a different ways. <- what’s next: robustness. Though the phenomenon compromises the completeness of our dataset, we can easily get rid of empty observations using semi_join(), keeping only those with the primary information stored in the toys_prim table. Same issues might happen in the following scraping over the whole dataset but that is acceptable.

# initialization for the loop
i <- 0
link <- corr_view$also_view[1]
remDr$navigate(link)
my_toy <- get_product(link, remDr)

As it was too time-consuming to scrape the data again, I did not run the chunk below while kniting the report (can see how it works in the GIF I provided with), and a .RData file containing all my result would be included in my submission as the supplement.

# my_toy <- get_product(corr_view$also_view[1], remDr)
t1 <- Sys.time()

# for (link in corr_view$also_view[-1]) {
# the indexing is inclusive

# 20190924 1:21 AM i = 21797
# 28654
# 19165
# 29244 30769
# 12007/36758
for (link in corr_view$also_view[i+1:nrow(corr_view)]) {
  # counting, provide location for debugging
  i <- i + 1

  new_toy <- get_product(link, remDr)
  my_toy <- rbind_product(my_toy, new_toy)
      
}
t2 <- Sys.time()
writeLines(paste("Time elapsed:", format(t2 - t1, format = "%h")))
print(i)

# remDr$close()
load("my_toy.RData")
scraping

scraping

3. Analyses

With the harvest from web scraping in the last chapter, it is time for us to extract insights from the dataset and even from the combination of the two toys dataset we have. As the assignment is themed “web scraping”, I obtained as many useful information as possible while I did not utilize all of it in the analyses part after exploration.

3.1 Tidying the data

First thing first, because plenty of product pages expired or our crawler is not robust to every situation, I acquired not all but 32.66% the links iterated in a 48-hour web scraping (12007/36758 = 32.66%). That means we need to exclude some incomplete entries before diving into our analyses.

corr_view <- corr_view %>%
  group_by(uniq_id) %>%
  mutate(cv_rank = row_number()) %>%
  ungroup() %>%
  group_by(also_view) %>%
  mutate(cv_id = row_number()) %>%
  ungroup()

# delete the first repeated row
toys_prim_cv <- my_toy$toys_prim[2:12008,] %>%
  group_by(link_id) %>%
  mutate(cv_id = row_number()) %>%
  ungroup()

cv_full <- toys_prim %>%
  select(uniq_id, price, num_rev, avg_rating) %>%
  inner_join(corr_view, by = "uniq_id") %>% rename(link_id = also_view) %>%
  inner_join(toys_prim_cv, by = c("link_id", "cv_id")) %>%
  mutate(price_prim_corr = price.x - price.y) 

reg <- cv_full %>%
  select(-price.x, -price.y, -link_id, -uniq_id, -product, -cv_id) %>%
  rename(prim_rating = avg_rating, corr_rating = star_rating, prim_nrev = num_rev, corr_nrev = n_rev) %>%
  na.omit()

3.2 King of “also shopped for” – Yu-Gi-Oh Premium Gold Booster Pack - 15 Cards

Big companies, especially those Japenese, like Konami, Tamiya, Pampers, and Hornby dominates the billboard of the most also viewed items. Combining my own experience in my analysis, I have to say that Japanese toy comanies usually do a great job in deploying their product matrix by launching extandables with many components or differentiating the same one to take over every niche.

top30_cv <- cv_full %>%
  count(product) %>%
  arrange(by = desc(n)) %>%
  top_n(30, n) %>%
  mutate(product2 = fct_reorder(product, n)) 

  # plot it
top30_cv %>%
  ggplot(aes(product2, n)) +
  geom_col(alpha = 0.65, width = 0.618, aes(fill = n)) +
  # trim the axis labels, it was too long
  scale_x_discrete(label=function(x) paste0(strtrim(x, 30), "...")) +
  coord_flip() +
  theme(axis.ticks = element_line(color = "white"), legend.position = "none") +
  geom_text(aes(y = n, label = n), position = position_dodge(0.9), hjust = -0.5) +
  # scale_y_continuous(limit = c(-10, 5000), breaks = seq(0, 5000, by = 1000)) +
  labs(title = "Overall top 30 words based on frequency after removing stop words",
       x = "word",
       y = "frequency") +
  scale_fill_gradient(low = "#AFEEEE", high = "#9400D3")

3.3 The correlated visit majorly comes from products of the same category

We should notice that on Amazon the filed of “also view” was also occupied by products of the same brand/manufacturer and the phenomenon applies increasingly to famous brands. We should dig into the brands correlation’s to determine whether the hypothesis holds.

# category
cat_cv <- my_toy$category %>%
  distinct() 

cat_cv_first <- cat_cv %>%
  separate_rows(category, sep = "[[›]]") %>%
  group_by(link_id) %>%
  dplyr::summarize(cat_cv_f = first(category))

source_cat <- cv_full %>%
  filter(product %in% top30_cv$product) %>%
  left_join(distinct(cat_cv_first, link_id, .keep_all = TRUE), by = "link_id") %>%
  left_join(select(category, uniq_id, cat1),
            by = "uniq_id") %>%
  mutate(same_cat = (cat1 == cat_cv_f)) %>%
  group_by(product, same_cat) %>%
  count(product) %>%
  ungroup()


source_cat %>%
  na.omit() %>%
  ggplot(aes(product, n, fill = same_cat, label = n)) +
    geom_col(width = 0.618, alpha = 0.8, position = "fill") +
    theme(axis.ticks = element_line(color = "white")) +
    labs(title = "How many were from the same category") +
  scale_x_discrete(label=function(x) paste0(strtrim(x, 30), "...")) +
  coord_flip() +
  labs(y = "proportion")

3.4 The position of products in the “also view” field is random

At first I suspected that deciding the position of a related product may involve the consideration of the features such as the number of reviews, the star raing, the price because those are the information shown directly on a product info page without cliking the hyperlink. Personally, I sometimes would be lured by the recommended “also view’ items and be converted eventually by another seller than the one I initially chose. Therefore, it was reaonable to hypothesize that the sellers or manufacturers would campaign in this zone. A common and mature strategy adopted by Consumer Packaged Goods company like P&G is to position products that compete directly with its compatibles because this is a game of”beat or beaten", no middle ground.

However, the model trained with Neural Network of Tensorflow seemed that these features does not have an impact on the postion rank of the also viewed items. The explanation from me is still that the analysis of the consumer behavior is so all-inclusive that demand a higher-dimension data.

Here, I employed a sequential model with two densely connected hidden layers, and an output layer that returns a single value. The model building steps are wrapped in a function, build_model. And I decided the MSE, mean square error to be our objective function.

Terminology explanation: - epoch: The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset.

### 1 DATA BANK ACQUIRING

## 80% of the sample size
smp_size <- floor(0.8 * nrow(reg))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(reg)), size = smp_size)

train_reg <- reg[train_ind, ]
test_reg <- reg[-train_ind, ]

train <- train_reg %>% select(- cv_rank)
train_labels <- train_reg %>% select(cv_rank) 
train_labels <- train_labels$cv_rank

test <- test_reg %>% select(- cv_rank)
test_labels <- test_reg %>% select(cv_rank) %>% as.vector()
test_labels <- test_labels$cv_rank


### 2 SCALE
# Test data is *not* used when calculating the mean and std.

# Normalize training data
train <- scale(train) 

# Use means and standard deviations from training set to normalize test set
col_means_train <- attr(train, "scaled:center") 
col_stddevs_train <- attr(train, "scaled:scale")
test <- scale(test, center = col_means_train, scale = col_stddevs_train)

train[1, ] # First training sample, normalized
##       prim_nrev     prim_rating     corr_rating       corr_nrev 
##      -0.2381505      -0.5317486       0.8722719      -0.4476890 
## price_prim_corr 
##      -0.3702088
test[1, ] # First testing sample, normalized
##       prim_nrev     prim_rating     corr_rating       corr_nrev 
##       0.1955015      -2.2117992      -0.9048427       1.2089260 
## price_prim_corr 
##      -0.3706690
### 3 CREATE MODEL

build_model <- function() {
  
  model <- keras_model_sequential() %>%
    layer_dense(units = 64, activation = "relu",
                input_shape = dim(train)[2]) %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 1)
  
  model %>% compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list("mean_absolute_error")
  )
  
  model
}

model <- build_model()
model %>% summary()
## Model: "sequential"
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense (Dense)                    (None, 64)                    384         
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 64)                    4160        
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 1)                     65          
## ===========================================================================
## Total params: 4,609
## Trainable params: 4,609
## Non-trainable params: 0
## ___________________________________________________________________________
### 4 TRAIN THE MODEL

# Display training progress by printing a single dot for each completed epoch.
print_dot_callback <- callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 80 == 0) cat("\n")
    cat(".")
  }
)   

# default to 500, can be increased if the model does not converge at 500 iterations. The quicker towards convergence, the more robust the model.
epochs <- 500

# Fit the model and store training stats
history <- model %>% fit(
  train,
  train_labels,
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(print_dot_callback)
)
## 
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ....................
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(ylim = c(0, 2))

The model converges very quickly, so we can stops the training earlier if a set amount of epochs elapses without showing improvement.

early_stop <- callback_early_stopping(monitor = "val_loss", patience = 20)

model <- build_model()
history <- model %>% fit(
  train,
  train_labels,
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(early_stop, print_dot_callback)
)
## 
## .................................
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(xlim = c(0, 50), ylim = c(0, 5))

4. Conclusion

The work this time leaves me with valuable techniques and data to explore.

The RSelenium1 armed us with the ability to crawl infomation from more kinds of websites although it was very slow compared to rvest. I expect a possile improvemnt technically to be the multi-thread progamming.

With these preparation, I found that part of companies would make efforts to take over the “also view” area, preventing it from leaving to competitors. Besides, even the neural network analysis using Tensorflow can not partition the layout of the field since the information is limited.

The very analysis I wish I had time for was the network analysis using an algorithm like PageRank as an extension of my comparison between the categories as I proposed in Case Study 1. I will consider this Rmd a breathing document and plish it in the future.

Acknowledgement

I would like to give my personal thanks to Mr.Gleason who brought two powerful tools to me that solved exactly what I encountered during the one-month web-scarping tasks. The selectorGadget helped me find the xpath easily and visually, and the RSelenium created a virtual browser for me so that I can scale my “copy-and-paste” to thousands.